Kristian K Ullrich
Max Planck Institute for Evolutionary Biology
A matter of distance.
Synonymous and NonSynonymous Substitutions: nucleotide substitutions might lead to changes on amino acid level.
Typically used to:
An R/Bioconductor package to calculate pairwise distances between all sequences of a MultipleSequenceAlignment (DNAStringSet or AAStringSet) and to conduct codon based analysis.
MSA2dist Features:
From Bioconductor:
From GitHub:
Data: Whole-genome coding sequences and gene ranges for Chlorophyta algae + Physcomitrium patens as an outgroup.
Source: Pico-PLAZA 3.0
Where to find: the data/ directory in the GitHub repo associated with this presentation.
data
├── clusters.rda
├── codingsequences.rda
└── data_acquisition_cds.md
codingsequences.rda: List of DNAStringSet objects.clusters.rda: Pre-calculated syntenet clusters.codingsequences [1] "Aprotothecoides" "Helicosporidiumsp" "Chlorellasp"
[4] "PicRCC4223" "PicSE3" "Asterochlorissp"
[7] "Csubellipsoidea" "Creinhardtii" "Vcarteri"
[10] "Bprasinos" "Otauri" "Osp"
[13] "Olucimarinus" "Omediterraneus" "Msp"
[16] "Mpusilla" "Ppatens"
DNAStringSet object of length 6:
width seq names
[1] 531 ATGGAGGGTGTGCACAGGCAGCT...CGGGCACTGCCACCCCCATCTGA AP00G00010
[2] 1542 ATGCTCGCAGTCCTGGCTCGCCG...TGCGCCGCAGGCTGCTGGGCTGA AP00G00020
[3] 1248 ATGGAGGCAGAGACGGCCCAGCC...GCAGCTGGAAGGCGAAGTCATAG AP00G00030
[4] 1086 ATGGAGGCTCTCTCCGGCACCCG...TGGCCGGCGCGAGCTCGGTGTGA AP00G00040
[5] 930 ATGAGCTTCGTGACGGTGGGAGG...GGAGGCTCTACCCCGGCATGTGA AP00G00050
[6] 831 ATGGCCAACCTTTTTATCAACCT...TGCCAGAGGAGATCATCCTGTGA AP00G00060
clusterslibrary(MSA2dist)
library(dplyr)
library(stringr)
# Get random cluster of size 10
my_cluster <- clusters %>% dplyr::filter(
Cluster==sample(which(table(clusters$Cluster)==10), 1))
head(my_cluster) Gene Cluster
1 Bpra_BPRRCC1105_15G00170 15748
2 Osp_ORCC809_19G00990 15748
3 Oluc_OL19G00210 15748
4 Oluc_OL19G00220 15748
5 Omed_OM_19G00230 15748
6 Omed_OM_19G00240 15748
# Get species unique identifier
ab_len <- syntenet:::species_id_length(codingsequences)
s_abbrev <- substr(names(codingsequences), 1, ab_len)
# Fetch coding sequences
my_cds <- do.call(c, apply(
stringr::str_split_fixed(my_cluster$Gene, "_", 2), 1,
function(x) {
tmp<-codingsequences[s_abbrev==x[1]][[1]];
tmp[names(tmp)==x[2]]}))
my_cdsDNAStringSet object of length 10:
width seq names
[1] 1563 ATGGCTACTTATTGCGTCGGCGA...TCCAATGCCGACGGAAATGTAA BPRRCC1105_15G00170
[2] 3579 ATGCTGTGTGTCGGCGCCGTGTC...ACCCGTCGACGGCGCCGCGTAA ORCC809_19G00990
[3] 3276 ATGGGCGAAAGACGCGCGACGAC...CGTCACGTTGGATAGTCACTGA OL19G00210
[4] 1302 ATGGCTGTTGGTGATGACGTCGA...CGTCGCTCCGGAACCAGCGTAA OL19G00220
[5] 3126 ATGCGCGACGCGACGCGCGATGC...CGACGACGACGACGACGCGTAA OM_19G00230
[6] 1440 ATGTCTGCGGTGTATTGCGTGGA...CAAGCCGGTGCCGGTTGAATAA OM_19G00240
[7] 1905 ATGAAGGGGCACGCGACCCTTGA...CATGGAGCCCTCCAAGGTTTAA MRCC299_10G01060
[8] 1434 ATGGCCTCATGGTGCACCGCCGC...CTCCGCGTACAACAACGCATAA MP13G03670
[9] 1389 ATGTCGCTCACCGAATCTGGCGC...CGTCGCCCCGGAACCGGCGTAA OT_20G00200
[10] 3273 ATGGGAAACGGTCCCGAGGTCGA...GGAGGATTTCTCAGTCGTATGA OT_20G00190
To calculate a coding sequence alignment for two sequences:
# Get coding sequence alignment
cds1_cds2_aln <- cds2codonaln(cds1 = my_cds[1], cds2 = my_cds[2], remove.gaps = FALSE)
cds1_cds2_alnDNAStringSet object of length 2:
width seq names
[1] 3585 ATG---------------GCTAC...TGCCGACGGAA------ATGTAA BPRRCC1105_15G00170
[2] 3585 ATGCTGTGTGTCGGCGCCGTGTC...AACCCGTCGACGGCGCCGCGTAA ORCC809_19G00990
To calculate dN/dS from this alignment:
To calculate all pairwise combinations:
start_time <- Sys.time()
# Calculate dN/dS for the whole cluster; model = YN
my_cds_kaks <- dnastring2kaks(my_cds,
model = "YN", threads = 2, isMSA = FALSE)
end_time <- Sys.time()
head(my_cds_kaks) Comp1 Comp2 seq1 seq2 Method Ka Ks
result.1 1 2 OL13G02250 OL13G02260 YN 0.916497 3.64537
result.2 1 3 OL13G02250 BPRRCC1105_04G02240 YN 0.187875 3.57243
result.3 1 4 OL13G02250 ORCC809_13G00730 YN 0.0406041 2.06945
result.4 1 5 OL13G02250 OL21G00740 YN 0.916497 3.64537
result.5 1 6 OL13G02250 OM_08G02050 YN 0.061288 3.46937
result.6 1 7 OL13G02250 MRCC299_06G05020 YN 0.150086 99
Ka/Ks P-Value(Fisher) Length S-Sites N-Sites Fold-Sites(0:2:4)
result.1 0.251414 4.97415e-11 612 129.087 482.913 NA
result.2 0.0525901 2.12165e-60 687 117.125 569.875 NA
result.3 0.0196207 1.05183e-59 690 124.832 565.168 NA
result.4 0.251414 4.97415e-11 612 129.087 482.913 NA
result.5 0.0176655 1.21514e-62 690 138.058 551.942 NA
result.6 0.00151602 0 687 100.966 586.034 NA
Substitutions S-Substitutions N-Substitutions
result.1 361 107.055 253.945
result.2 204 109.458 94.5418
result.3 110 87.6938 22.3062
result.4 361 107.055 253.945
result.5 135 102.529 32.4711
result.6 193 113.34 79.6598
Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4)
result.1 NA NA
result.2 NA NA
result.3 NA NA
result.4 NA NA
result.5 NA NA
result.6 NA NA
Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA)
result.1 1.49209 1.44681:1.44681:1:1:1:1
result.2 0.764902 0.795759:0.795759:1:1:1:1
result.3 0.407654 2.7026:2.7026:1:1:1:1
result.4 1.49209 1.44681:1.44681:1:1:1:1
result.5 0.743189 1.66709:1.66709:1:1:1:1
result.6 14.6777 0.848586:0.848586:1:1:1:1
GC(1:2:3) ML-Score AICc Akaike-Weight Model
result.1 0.535276(0.483129:0.476994:0.645706) NA NA NA NA
result.2 0.384194(0.361502:0.336463:0.454617) NA NA NA NA
result.3 0.46732(0.401961:0.404139:0.595861) NA NA NA NA
result.4 0.535276(0.483129:0.476994:0.645706) NA NA NA NA
result.5 0.422089(0.394692:0.372432:0.499144) NA NA NA NA
result.6 0.477585(0.401302:0.392625:0.638829) NA NA NA NA
How long did it take?
Here, a pre-calculated MSA is necessary:
Plot average behavior of each codon:
Kristian K Ullrich @kullrich